# High-Precision Visual Question Answering
Videorefer 7B Stage2.5
Apache-2.0
VideoRefer-7B is a multimodal model based on a video large language model, focusing on spatio-temporal object understanding tasks.
Text-to-Video
Transformers English

V
DAMO-NLP-SG
20
2
Llama 3.2V 11B Cot
Apache-2.0
Llama-3.2V-11B-cot is a visual-language model capable of spontaneous and systematic reasoning, developed based on the LLaVA-CoT framework.
Image-to-Text
Transformers English

L
Xkev
5,089
151
Xgen Mm Phi3 Mini Instruct Singleimg R V1.5
Apache-2.0
xGen-MM is a series of the latest foundational large multimodal models developed by Salesforce AI Research. It is improved based on the successful design of the BLIP series, providing more powerful multimodal processing capabilities.
Image-to-Text
Safetensors English
X
Salesforce
313
15
Internlm Xcomposer2 Vl 7b
Other
InternLM-XComposer2 is a vision-language large model developed based on InternLM2, featuring outstanding image-text understanding and creation capabilities.
Text-to-Image
Transformers

I
internlm
1,902
82
Featured Recommended AI Models